2 research outputs found

    Slovene and Croatian word embeddings in terms of gender occupational analogies

    Get PDF
    In recent years, the use of deep neural networks and dense vector embeddings for text representation have led to excellent results in the field of computational understanding of natural language. It has also been shown that word embeddings often capture gender, racial and other types of bias. The article focuses on evaluating Slovene and Croatian word embeddings in terms of gender bias using word analogy calculations. We compiled a list of masculine and feminine nouns for occupations in Slovene and evaluated the gender bias of fastText, word2vec and ELMo embeddings with different configurations and different approaches to analogy calculations. The lowest occupational gender bias was observed with the fastText embeddings. Similarly, we compared different fastText embeddings on Croatian occupational analogies

    List of single-word male and female occupations in Slovenian

    No full text
    The list of single-word occupations in Slovene is based on the Slovene Standard Classification of Occupations (https://www.uradni-list.si/glasilo-uradni-list-rs/vsebina?urlid=199728&stevilka=1641). The list includes 234 occupation pairs. For each occupation, it contains its masculine word form (e.g. fotograf), its possible synonym, its feminine equivalent (e.g. fotografka) and the corresponding synonym of the feminine form (e.g. fotografinja). The cases where no synonyms were added for a specific occupation are denoted with the label 0 (note that only synonyms with the same root are considered). Several conditions for inclusion or exclusion of an occupation to the list were applied: - Our list contains only single word occupation pairs, while the majority of the occupations in the aforementioned classification are multi-word expressions. - An occupation has to exist both in female and male grammatical gender (gender-neutral words such as pismonoša [en. postman] are not included in the list). - At least one of the variants of an occupation (masculine or feminine) occurs at least 500 times in the Corpus of Written Standard Slovene Gigafida 2.0. - The occupations that are also proper names in Slovene, e.g. kovač [en. blacksmith], were filtered out if in the Slovene Morphological Lexicon Sloleks 2.0 (Dobrovoljc et al., 2019) the proper name form exists. - Occupations that could be easily associated with a context unrelated to occupations (e.g. čarovnik/čarovnica [en. wizard/witch]) or where a male or female variant is a homograph of a common noun (e.g. detektivka [en. detective] also denotes a detective novel) were excluded from the final set of occupations. When a more established version of an occupation exists, we manually add a synonym with the same root (e.g. in the case of fotografka, an arguably more established fotografinja was added [en. photographer]). If the standard classification does not include the female (e.g. dramatik [en. playwright]) or the male version (e.g. prostitutka [en. prostitute]) of an occupation, the missing version is manually added if it exists and appears in Gigafida corpus (e.g. there are no established words for female and male versions of postrešček [en. porter] and hostesa [en. hostess]). The list of occupations can be used for different natural language processing tasks including evaluation of word embeddings models through analogies, which can point to bias in language use. If you use the dataset, please cite the following paper: SUPEJ, Anka, ULČAR, Matej, ROBNIK ŠIKONJA, Marko, POLLAK, Senja (2020). Primerjava slovenskih besednih vektorskih vložitev z vidika spola na analogijah poklicev. Zbornik konference Jezikovne tehnologije in digitalna humanistika / Proc. of the Conference on Language Technologies and Digital Humanities, p. 93-100
    corecore